Multi-Scale Spoken Document Retrieval for Cantonese Broadcast News
نویسندگان
چکیده
This paper presents the application of a multi-scale paradigm to Chinese spoken document retrieval (SDR) for improving retrieval performance. Multi-scale refers to the use of both words and subwords for retrieval. Words are basic units in a language that carry lexical meaning and subword units (such as phonemes, syllables or characters) are building components for words. Retrieval using subword indexing units is found to perform better than words because of the robustness of subword units to out-of-vocabulary (OOV) words during speech recognition and ambiguities in word segmentation. Experimental results have demonstrated that subword bigrams can bring improvement in retrieval performance over words (~9.56%). Application of multi-scale fusion to SDR aims at combining the lexical information of words and the robustness of subwords. This work presents the first detailed investigation for a Cantonese broadcast news retrieval task using two different multi-scale fusion approaches: pre-retrieval fusion and post-retrieval fusion. Multi-scale retrieval using both words and syllable bigrams achieve improvement in retrieval performance (~1.90%) over retrieval on the composite scales.
منابع مشابه
Document Expansion using a Side Collection for Monolingual and Cross-language Spoken Document Retrieval
This paper presents a method of document expansion using a side collection for improving the overall performance in retrieving spoken documents using text queries. This method is applied to Chinese spoken document retrieval (SDR) tasks where a series of experiments have been carried out for both monolingual and cross-language SDR systems. In our monolingual retrieval experiments, Cantonese broa...
متن کاملMulti-scale document expansion in English-Mandarin cross-language spoken document retrieval
This paper presents the application of document expansion using a side collection to a cross-language spoken document retrieval (CL-SDR) task to improve retrieval performance. Document expansion is applied to a series of EnglishMandarin CL-SDR experiments using selected retrieval models (probabilistic belief network, vector space model, and HMM-based retrieval model). English textual queries ar...
متن کاملMandarin-English Information (MEI): investigating translingual speech retrieval
This paper describes theMandarin–English Information (MEI) project, wherewe investigated the problemof cross-language spoken document retrieval (CL-SDR), and developed one of the first English–Chinese CL-SDR systems.Our systemaccepts an entireEnglish news story (text) asquery, and retrieves relevantChinese broadcast news stories (audio) from the document collection.Hence, this is a cross-langua...
متن کاملExperiments in syllable-based retrieval of broadcast news speech in Mandarin Chinese
Spoken document retrieval (SDR) has been extensively studied in recent years because of its potential use in navigating large multi-media collections in the near future. Considering the characteristics and monosyllabic structure of the Chinese language, the syllable-based indexing for retrieval of spoken documents in Mandarin Chinese has been investigated, and extensive experiments on retrieval...
متن کاملRecognition, indexing and retrieval of british broadcast news with the THISL system
This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retrieval system. Detailed evaluation is performed u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- I. J. Speech Technology
دوره 7 شماره
صفحات -
تاریخ انتشار 2004